HW 5

Author

Natalie Poupart

Introduction

In this notebook, I will combine the materials from all 5 sessions of class to create a complete analysis of a data set. I will use the “sleep duration” data set* that compares sleep duration with having a tv in one’s bedroom and smartphone use before bed for teenagers.

*This data set was received in PUBH 6862

Read in libraries

library(readr)
library(psych)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(plotly)
library(vcd)
library(epitools)
library(lmtest)
library(dplyr)

Read in data

setwd(getwd())
sleep <- readr::read_csv("sleep_duration.csv")

Data Dictionary

We have \(4\) variables, one in each column of the data set.

The properties of the variables are tabled below.

Variable Explanation Properties
\(\texttt{sleep}\) Hours of sleep per night hours
\(\texttt{age}\) Age of participant years
\(\texttt{tv}\) TV in Bedroom 0 if no and 1 yes
\(\texttt{smartphone}\) Smartphone before bed 0 if no and 1 yes

Explore data

head(sleep)
# A tibble: 6 × 4
  sleep    tv   age smartphone
  <dbl> <dbl> <dbl>      <dbl>
1   7.8     1    18          0
2   8.6     0    18          1
3   8.7     1    18          0
4   7.2     1    13          1
5   6.3     1    18          1
6   7.6     0    18          1
summary(sleep)
     sleep              tv            age          smartphone  
 Min.   : 5.200   Min.   :0.00   Min.   :11.00   Min.   :0.00  
 1st Qu.: 7.100   1st Qu.:0.00   1st Qu.:14.00   1st Qu.:0.00  
 Median : 7.800   Median :0.00   Median :15.50   Median :0.00  
 Mean   : 7.884   Mean   :0.42   Mean   :15.50   Mean   :0.36  
 3rd Qu.: 8.700   3rd Qu.:1.00   3rd Qu.:17.75   3rd Qu.:1.00  
 Max.   :10.800   Max.   :1.00   Max.   :19.00   Max.   :1.00  

Summary Statistics

describe(sleep)
           vars  n  mean   sd median trimmed  mad  min  max range  skew
sleep         1 50  7.88 1.27    7.8    7.88 1.26  5.2 10.8   5.6  0.14
tv            2 50  0.42 0.50    0.0    0.40 0.00  0.0  1.0   1.0  0.31
age           3 50 15.50 2.11   15.5   15.55 2.22 11.0 19.0   8.0 -0.07
smartphone    4 50  0.36 0.48    0.0    0.32 0.00  0.0  1.0   1.0  0.57
           kurtosis   se
sleep         -0.35 0.18
tv            -1.94 0.07
age           -1.22 0.30
smartphone    -1.71 0.07

Research questions

Primary research question

Are \(\texttt{tv}\) and \(\texttt{smartphone}\) associated with \(\texttt{sleep}\)?

Describe

psych::describe(sleep$sleep)
   vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
X1    1 50 7.88 1.27    7.8    7.88 1.26 5.2 10.8   5.6 0.14    -0.35 0.18

The average number of hours of sleep per night is 7.88 with a standard deviation of 1.27. The median number of hours of sleep a night is 7.8. The minimum number of hours of sleep per night is 5.2 and the maximum is 10.8

Update labels of categorical variables

sleep <- sleep %>%
  mutate(
    tv = factor(tv, levels = c(0, 1), labels = c("No", "Yes")),
    smartphone = factor(smartphone, levels = c(0, 1), labels = c("No", "Yes"))
  )

Descriptive stats for sleep by the 2 groups

psych::describeBy(
  sleep$sleep,
  group =sleep$tv
)

 Descriptive statistics by group 
group: No
   vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
X1    1 29 8.32 1.16    8.5    8.26 1.19 6.4 10.8   4.4 0.36    -0.68 0.22
------------------------------------------------------------ 
group: Yes
   vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 21 7.28 1.18    7.2    7.28 1.19 5.2 9.5   4.3 0.02    -0.92 0.26
psych::describeBy(
  sleep$sleep,
  group =sleep$smartphone
)

 Descriptive statistics by group 
group: No
   vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
X1    1 32 8.07 1.25    8.2     8.1 1.33 5.2 10.8   5.6 -0.2    -0.48 0.22
------------------------------------------------------------ 
group: Yes
   vars  n mean   sd median trimmed  mad min  max range skew kurtosis  se
X1    1 18 7.54 1.26   7.25    7.47 0.89 5.4 10.8   5.4 0.74     0.54 0.3

Data Visualization

sleep_tv <- (sleep %>% ggplot2::ggplot(
  aes(
    x = tv,
    y = sleep
  )
) +
  ggplot2::geom_boxplot(
    aes(
      fill = tv
    ),
    show.legend = FALSE
  ) +
  ggplot2::labs(
    title = "Distribution of Sleep Hours",
    subtitle = "Comparison between TV and no TV in room"
  ) +
  ggplot2::scale_fill_manual(
    values = c("Yes" = "#17becf", "No" = "#e377c2")
  ) +
  ggplot2::xlab("TV in Room") +
  ggplot2::ylab("Sleep Duration") +
  ggthemes::theme_clean());

plotly::ggplotly(sleep_tv)
sleep_phone <- (sleep %>% ggplot2::ggplot(
  aes(
    x = smartphone,
    y = sleep
  )
) +
  ggplot2::geom_boxplot(
    aes(
      fill = smartphone
    ),
    show.legend = FALSE
  ) +
  ggplot2::labs(
    title = "Distribution of Sleep Hours",
    subtitle = "Comparison between Phone Use Before Bed and No Phone Use Before Bed"
  ) +
  ggplot2::scale_fill_manual(
    values = c("Yes" = "#17becf", "No" = "#e377c2")
  )+
  ggplot2::xlab("Phone Use Before Bed") +
  ggplot2::ylab("Sleep Duration") +
  ggthemes::theme_clean());

plotly::ggplotly(sleep_phone)
vcd::mosaic(
  ~ tv + smartphone,
  data = sleep,
  highlighting = "tv",
  highlighting_fill = c("skyblue", "lightyellow"),
  main = "Mosaic plot of binary TV and Smartphone"
)

Inferential Statistics

Comparing a continuous variable between two groups can be conducted using a t test if the assumptions for the use parametric tests are met.

The assumptions are (1) normality and (2) Equality of Variances

Normality

Here, we will use the Shapiro-Wilk test to determine if the continuous variable is from a population in which the values are normally distributed. Under the null hypothesis for this test, the variable can be described by a normal distribution. We will set a level of significance of \(\alpha=0.05\) throughout.

Below, we use the filter and select verbs from the dplyr library and the the pull function that returns a vector of values. We use a pipeline to pipe the vector to the shapiro.wilk function. We start with the sleep duration of those without a TV in their room.

sleep %>% 
  filter(tv == "No") %>% 
  select(sleep) %>% 
  pull() %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.96206, p-value = 0.369

Now with a TV

sleep %>% 
  filter(tv == "Yes") %>% 
  select(sleep) %>% 
  pull() %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.98242, p-value = 0.9557

In both cases, we fail to reject the null hypothesis. We can state that the variable is normally distributed in the population.

Now we will check the normality of sleep duration for those who use a smartphone before bed.

sleep %>% 
  filter(smartphone == "No") %>% 
  select(sleep) %>% 
  pull() %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.98243, p-value = 0.8662
sleep %>% 
  filter(smartphone == "Yes") %>% 
  select(sleep) %>% 
  pull() %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.94375, p-value = 0.3355

Again, in both cases, we fail to reject the null hypothesis. We can state that the variable is normally distributed in the population.

Equality of Variances

The other test that we will perform, is Bartlett’s test. The null hypothesis is that we have equal variance for the continuous variable comparing the two groups. The bartlett.test function performs this test.

First, we will test TV in room

bartlett.test(
  sleep$sleep, 
  sleep$tv 
)

    Bartlett test of homogeneity of variances

data:  sleep$sleep and sleep$tv
Bartlett's K-squared = 0.0058349, df = 1, p-value = 0.9391

Next, we will test smartphone use before bed

bartlett.test(
  sleep$sleep, 
  sleep$smartphone 
)

    Bartlett test of homogeneity of variances

data:  sleep$sleep and sleep$smartphone
Bartlett's K-squared = 0.00064119, df = 1, p-value = 0.9798

In both cases, we fail to reject the null hypothesis and can use an equal variance t test. This test is performed using the t.test function

Statistical Analysis

Is there a difference in average sleep duration between those with and without a TV in their bedroom?

t.test(
  formula = sleep ~ tv,
  data = sleep,
  alternative = "two.sided", 
  var.equal = TRUE 
)

    Two Sample t-test

data:  sleep by tv
t = 3.1035, df = 48, p-value = 0.003203
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 0.3661298 1.7133448
sample estimates:
 mean in group No mean in group Yes 
         8.320690          7.280952 

We reject the null hypothesis that the mean number of hours of sleep per night is the same for those with a TV in their room and those without. The mean number of hours of sleep per night is significantly lower for those with a TV in their room (mean = 7.28) compared to those without a TV in their room (mean = 8.32) with p = 0.003.

We conclude that there is enough evidence in the data at the \(\alpha=0.05\) level of significance to state that there is a difference between the sleep duration of teenagers with and without a TV in their bedroom.

Is there a difference in average sleep duration between those who have smartphone use before bed?

t.test(
  formula = sleep ~ smartphone,
  data = sleep,
  alternative = "two.sided", 
  var.equal = TRUE 
)

    Two Sample t-test

data:  sleep by smartphone
t = 1.4354, df = 48, p-value = 0.1577
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -0.2126135  1.2737246
sample estimates:
 mean in group No mean in group Yes 
         8.075000          7.544444 

We fail to reject the null hypothesis that the mean number of hours of sleep per night is the same for those with smartphone use before bed and those without. The mean number of hours of sleep per night is not significantly lower for who use a smartphone before bed (mean = 7.54) compared to those who do not use a smartphone before bed (mean = 8.08) with p = .1577

We conclude that there is not enough evidence in the data at the \(\alpha=0.05\) level of significance to state that there is a difference between the sleep duration of teenagers who do and don’t use a smartphone before bed.